This is ‘Open data science’ course. I am familiar with R and R studio. I have used Git earlier but I am not confident enough to use it independently. I am interested in learning about data science and how Statistics can be conveyed to non-statisticians. I was searching for a course not necessarily in Statistics but a course that will talk about data science applicable for any field and some simple ways of learning and using Github. The reviews of earlier students of this course encouraged me to enroll in this course. My github repository for this course: https://github.com/Ashwini-Helsinki/IODS-project
First thing is to load required libraries or R packages for this exercise.
# Required libraries
library(GGally)
library(ggplot2)
The data created in data wrangling exercise is read as ‘data2’ here. The data is about 166 students. Each row describes one student with 7 columns of information. Information such as age, gender, student’s learning approaches (such as deep learning, surface learning and strategic learning) and attitude towards Statistics is given in various columns of the data. Column ‘Points’ gives exam points of each student.
setwd("D:/Helsinki/Courses/OpenDataScience/IODS-project/")
data2 <- read.csv("data/learning2014.csv",header = TRUE)
# Dimension of data: rows and columns respectively.
dim(data2)
## [1] 166 7
# Stucture of data
str(data2)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
Summary of each variable in the analysis data is shown below. For each variable its minimum value, maximum value, 1st, 2nd (median) and 3rd quantile and mean value is shown in the summary.
summary(data2)
## gender Age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Let’s examine in detail how each of the variable in the data is distributed and the relationship of these variables with each other.
# create graphical overview of variables with ggpairs()
p <- ggpairs(data2, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
Above graph shows distribution of each variable in the data and its correlation with other variables in the data. This is a plot matrix with 7 rows and 7 columns. Each row and each column represents one variable from the data.
For all other variables except ‘gender’, density plot is shown in the diagonal position. In all other positions of the same column below the diagonal entry, a scatter plot of joint observations with the variable in the corresponding row are shown. Similarly, in the same row on the right side of the diagonal position shows correlation between the row and column variables.
Since gender variable has only 2 values ‘F’ and ‘M’, above plot has shown histogram for gender variable. Also, its relation with other variables is also shown by histograms (in the first column) and box plots (in the first row) of those variables in each gender.
From the graph, it can be seen that there are more than 100 female students and around 50 male students. Looking at the density plot of ‘Age’, most of the students are of the age less than 30. The long right tail suggests that there are a few students above 30 and up to 60 years of age. Density of ‘attitude’ variable looks normal with span from 1 to 5 with mode around 3. ‘deep’ learning approach has a long left tail. Strategic learning ‘stra’ and surface learning both show slightly bimodal densities. Points has a mode around 22 ans a small amont of observations near 11.
From the correlations, it can be seen that attitude towards Statistics and exam points are positively correlated. It means better the attitude more the exam points. Surface learning is negatively correlated with all other variables.
How these variables are different for two ‘gender’ groups is shown in the following graph. Transparency of the overlapping plots is set by alpha.
# create graphical overview of variables with ggpairs()
p1 <- ggpairs(data2, mapping = aes(col= gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p1
Let’s fit a linear regression model to study effect of three covariates attitude, deep and stra on exam points.
my_model1 <- lm(Points ~ attitude + deep + stra, data= data2)
summary(my_model1)
##
## Call:
## lm(formula = Points ~ attitude + deep + stra, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## deep -0.7492 0.7507 -0.998 0.31974
## stra 0.9621 0.5367 1.793 0.07489 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
The ‘t value’ in the summary table shows test statistic for the statistical test of parameter (coefficient) corresponding to each covariate (explanatory variable). It tests if the coefficient is zero or not. If the coefficient is away from zero that means the corresponding explanatory variable has impact on the response (target) variable. If the standardize test statistic (t value) is large ( positive or negative) then the corresponding p-value is small; indicating statistical significance of the covariate.
Summary of the model shows that ‘attitude’ and ‘stra’ have statistically significant relationship with the target variable ‘Points’ since the p-values corresponding to them are small. p-value corresponding to co-variate ‘attitude’ is less than 0.001. The estimate of its coefficient is 3.5254 with standard error 0.5683. Increase of 1 unit in attitude increases the exam points by 3.5254 units.
p-value corresponding to ‘stra’ is less than 0.1. It has coefficient estimate of 0.9621 indicating 0.9621 units increase in exam points when ‘stra’ is increased by 1 unit.
Variable ‘deep’ is not showing significant relationship with ‘Points’ (p-value > 0.3) hence, it is removed from the model.
The intercept term has a large estimate of 11.3915. The p-value corresponding to intercept is also very small indicating that some important covariates are not considered in the model.
However, no new covariate will be added at this stage. As per the instructions in the exercise, we will fit the new model by removing ‘deep’.
my_model2 <- lm(Points ~ attitude + stra , data= data2)
summary(my_model2)
##
## Call:
## lm(formula = Points ~ attitude + stra, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
Summary of the new model is shown above. Relationship of attitude and ‘stra’ has not changed much as compared to the previous model. The model can be written as
Points = 8.9729 + (3.4658 * attitude) + (0.9137 * stra) + (Error term)
Error term is considered to be normally distributed. Multiple R-squared is 0.2048. It means this model does not explain variation in the response variable ‘Points’ well.
To study model fitting in more detail, let’s examine various plots of the models.
par(mfrow = c(1,3))
plot(my_model2, which=c(1,2,5))
Residual vs Fitted plot is a scatter plot and no specific relation can be obtained which will indicate possible relation of the fitted value with the residual. Also, the red line is a bit deviated from the horizontal line but the deviation is not large enough. Hence, it can be observed that size of the error does not depend on the explanatory variables as assumed in linear regression.
Second assumption in linear regression is that error term is normally distributed. Second plot ’Normal Q-Q plot shows that most of the observations are close to the straight line. But some observations (number 35, 56,145) and some observations at the top deviate a bit from the straight line. Points in the middle region follow the assumption of normality of error term.
From the Residual vs Leverage plot, it can be seen that observation number 35, 71 and 145 have extreme leverage as compared to other points. Still, the leverage is not outside Cook’s distance hence, there is no single observation impacting the model.
First thing is to load required libraries or R packages for this exercise.
# Required libraries
library(dplyr)
library(boot)
The data created in data wrangling exercise is read as ‘data3’ here. This data discusses student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. The dataset includes the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). G1, G2 and G3 show average grades earned by the student. ‘.math’ and ‘.por’ suffixes show grades in Maths and Portuguese language courses. G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades.
setwd("D:/Helsinki/Courses/OpenDataScience/IODS-project/")
data3 <- read.csv("data/Alcdata.csv",header = TRUE)
# Dimension of data: rows and columns respectively.
dim(data3)
## [1] 370 47
# variable names of data
colnames(data3)
## [1] "school" "sex" "age" "address"
## [5] "famsize" "Pstatus" "Medu" "Fedu"
## [9] "Mjob" "Fjob" "reason" "guardian"
## [13] "traveltime" "studytime" "failures.math" "schoolsup"
## [17] "famsup" "paid.math" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel"
## [25] "freetime" "goout" "Dalc" "Walc"
## [29] "health" "absences.math" "G1.math" "G2.math"
## [33] "G3.math" "failures.por" "paid.por" "absences.por"
## [37] "G1.por" "G2.por" "G3.por" "failures"
## [41] "paid" "absences" "G1" "G2"
## [45] "G3" "alc_use" "high_use"
This data is used in this analysis to study the relationships between high/low alcohol consumption and some of the other variables in the data.
Let’s choose four variables namely ‘sex’, ‘famrel’, ‘absences’ and ‘failures’. My hypotheses about relationships of these variables with high alcohol consumption are as follows:
1. ‘sex’ may not have direct relationship with high alcohol consumption.
2. ‘famrel’ i.e. Good family relationship has negative impact on high alcohol consumption. This is a categorical variable. Better the family relations less alcohol consumption.
3. ‘absences’ and ‘failures’ have positive impact on high alcohol consumption. More the absences more alcohol consumption.
Let’s observe the relationship of these 4 variables with high alcohol consumption.
# 2x2table
table(data3$sex, data3$high_use)
##
## FALSE TRUE
## F 154 41
## M 105 70
# bar graph
ggplot(data3, aes(x=sex, fill=high_use))+ geom_bar(aes(y = (..count..)/sum(..count..)))+
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies")
Both the tabular view and graph show that sex ‘M’ has more percentage of high alcohol consumption than female group ‘F’. It shows that my hypothesis about relationship of ‘sex’ and ‘high/low alcohol consumption’ is not true.
Let’s look at the tabular and graphical representation.
# tabular representation
table(data3$famrel, data3$high_use)
##
## FALSE TRUE
## 1 6 2
## 2 9 9
## 3 39 25
## 4 128 52
## 5 77 23
# bar graph
ggplot(data3, aes(x=famrel, fill=high_use))+ geom_bar(aes(y = (..count..)/sum(..count..)), position=position_dodge())+
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies")
Both the tabular and graph show that for famrel 1 to 3 categories, relative frequency of high alcohol consumption is substantial. However, in category 4 and 5 which indicates better family relationships, has lower percentage of students with high alcohol consumption. Here my hypothesis is somewhat true.
# tabular representation
table(data3$absences , data3$high_use)
##
## FALSE TRUE
## 0 50 13
## 1 37 13
## 2 40 16
## 3 31 7
## 4 24 11
## 5 16 6
## 6 16 5
## 7 9 3
## 8 14 6
## 9 5 6
## 10 5 2
## 11 2 3
## 12 3 4
## 13 1 1
## 14 1 6
## 16 0 1
## 17 0 1
## 18 1 1
## 19 0 1
## 20 2 0
## 21 1 1
## 26 0 1
## 27 0 1
## 29 0 1
## 44 0 1
## 45 1 0
# box plot
ggplot(data3, aes(x=high_use , y=absences))+ geom_boxplot()
Looking at the box plot, it can be seen that students with high alcohol consumption have more absences. The range in both the boxes is similar but the box with whisker in ‘True’ category is much larger and its box has 3rd quantile much above the median. It indicates that absences and high alcohol consumption are positively correlated. My hypothesis is true to an extent.
# tabular representation
table(data3$failures , data3$high_use)
##
## FALSE TRUE
## 0 238 87
## 1 12 12
## 2 8 9
## 3 1 3
# bar graph
ggplot(data3, aes(x=failures , fill=high_use))+ geom_bar(aes(y = (..count..)/sum(..count..)),position=position_dodge())+
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies")
Here high alcohol consumption increases with more number of failures. My hypothesis is true.
Let’s fit a logistic regression model to study effect of these 4 covariates sex, famrel, absences and failures on high/low alcohol consumption. Though famrel and failures can be looked as categorical variables, the data is not equally distributed over all categories so considering any one of the category as base is difficult. Let’s continue with these variables as continuous variable. ‘Sex’ is a binary variable so one of the categories will be treated as base category and other one will be treated in comparison to the base category.
# Fit logistic regression
my_model1 <- glm(high_use ~ sex + famrel + absences + failures , data= data3, family =binomial(link = "logit"))
# summary of the model
summary(my_model1)
##
## Call:
## glm(formula = high_use ~ sex + famrel + absences + failures,
## family = binomial(link = "logit"), data = data3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0786 -0.8216 -0.5746 0.9760 2.1820
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.79406 0.54281 -1.463 0.14350
## sexM 1.04800 0.25091 4.177 2.96e-05 ***
## famrel -0.29791 0.13044 -2.284 0.02238 *
## absences 0.08941 0.02274 3.932 8.43e-05 ***
## failures 0.57328 0.20531 2.792 0.00523 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 401.77 on 365 degrees of freedom
## AIC: 411.77
##
## Number of Fisher Scoring iterations: 4
# Estimate and confidence interval
cbind(coef(my_model1), confint(my_model1))
## 2.5 % 97.5 %
## (Intercept) -0.79406437 -1.87558320 0.26100198
## sexM 1.04800182 0.56283743 1.54857812
## famrel -0.29791173 -0.55591186 -0.04251048
## absences 0.08940969 0.04695413 0.13651044
## failures 0.57327802 0.17701334 0.98884313
‘sexM’ estimate gives coefficient of category ‘M’ when the base category ‘F’ is given by intercept term. All 4 covariates have p-value smaller than 0.05 so all 4 covariates show statistical significant relation with high alcohol consumption. Also, all the coefficients are away from zero and the 95% confidence interval exclude zero for all 4 covariates. To view these effects in terms of odds ratio, let’s take exponential of coefficients and confidence interval.
# Odds ratios by exponentiating coefficients of the model
print("Odds ratios - estimate and confidence interval")
## [1] "Odds ratios - estimate and confidence interval"
cbind(exp(coef(my_model1)), exp(confint(my_model1)))
## 2.5 % 97.5 %
## (Intercept) 0.4520039 0.1532656 1.2982302
## sexM 2.8519467 1.7556470 4.7047758
## famrel 0.7423669 0.5735490 0.9583804
## absences 1.0935286 1.0480739 1.1462668
## failures 1.7740730 1.1936470 2.6881229
Although, ‘absences’ has odds ratio 1.0935286 which is not far away from 1, its confidence interval excludes zero. Hence for all 4 Odds ratio, confidence intervals exclude 1 indicating statistical relation with the target variable high_use. If this result is compared with initial hypotheses. sexM has effect on high_use hence the initial hypotheses is not true. ‘famrel’ have negative effect on high_use since the odds ratio and 95% interval lie below 1. ‘absences’ has positive impact on high_use. ‘Failures’ clearly has relation with high_use. So for these 3 covariates initial hypothesis is true.
AIC of this model is 411.77. Only when AIC of other model is known it can be compared with this AIC to decide a better model. Model with lower AIC is considered as better fit model.
Next step is prediction using the model my_model1. predict() function will give probabilities as the prediction for each row in the dataset. The probabilities greater than 0.5 will be considered as ‘TRUE’ (high_use = TRUE or 1) value of prediction and others as ‘FALSE’ (high_use = FALSE or 0)
# predict() the probability of high_use
probabilities <- predict(my_model1, type = "response")
# add the predicted probabilities to the data to get a nwe dataset 'alc'
alc <- mutate(data3, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
Next table and graph shows prediction accuracy in terms of observed value of high_use and predictions.
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 244 15
## TRUE 77 34
print("In terms of proportions and with margins")
## [1] "In terms of proportions and with margins"
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65945946 0.04054054 0.70000000
## TRUE 0.20810811 0.09189189 0.30000000
## Sum 0.86756757 0.13243243 1.00000000
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
To compute prediction error in terms of average number of wrong predictions, let’s define a loss function as given in the DataCamp.
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2486486
Average number of wrong predictions with the model my_model1 are 0.24 (aprox).
Let’s perform 10-fold cross validation for ‘alc’ data, model my_model1 and the loss function defined above.
# K-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = my_model1, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2540541
With 10 fold cross validation, the average number of wrong predictions is near 0.26 (aprox.) ( The exact value is not mentioned here, since the initial seed is not set here, so each run of cv function gives slightly different value. But all of them are close to 0.26) which is more than the training error 0.24. This is expected since the training error is computed for prediction of observations used for training the model while CV error is obtained from the prediction of the observations which are not used for training the model. Training error can be viewed as error given by only estimation error. The uncertainty due to new unknown data is not there. Hence, CV error is always greater than the training error.
Looking at the cv error, one can claim that performance of this model is similar to that of the model given in DataCamp.
Next, let’s consider large number of predictors to start with and keep reducing 1 predictor in each step based on their deviance. In each step one covariate, which contributes to the error the most, is removed from the model. The process continues till no further improvement is achieved by removing a covariate. This can be performed using ‘step’ function on the full model.
# formula using 26 covariates in the data is created.
BiggerFormula <- as.formula(high_use ~sex + age+famsize+Pstatus+Medu+Fedu+Mjob+
Fjob+guardian+traveltime+studytime+schoolsup+famsup+activities+nursery+higher+
internet+romantic+famrel+freetime+goout+health+failures+paid+absences+G3)
# glm is fitted to the bigger formula.
FullModel <- glm(BiggerFormula,family=binomial(link="logit"),data=data3)
# For loop to reduce 1 covariate in every step.
finalDF <- c() # to save final result
numcov = 26 # initial number of covariates
for( mm in 1:26){
prevnumcov= numcov
# perform stepwise backward elimination with steps = mm
Newmodel <- step(FullModel,direction="backward",trace=FALSE, steps =mm)
numcov=length(coef(Newmodel))-1
if(numcov == prevnumcov){
print(paste0("No further improvement after ",prevnumcov, " variables." ))
break
}
# predict() the probability of high_use
probabilities <- predict(Newmodel, type = "response")
# add the predicted probabilities to the data to get a nwe dataset 'alc'
alc1 <- mutate(alc, probability1 = probabilities)
# use the probabilities to make a prediction of high_use
alc1 <- mutate(alc1, prediction1 = probability1 > 0.5)
trainerror = loss_func(class = alc1$high_use, prob = alc1$probability1)
cv <- cv.glm(data = alc1, cost = loss_func, glmfit = Newmodel, K = 10)
# average number of wrong predictions in the cross validation
newcv = cv$delta[1]
finalDF <- rbind(finalDF, c(numcov,trainerror,'Training'),c(numcov,newcv,'Prediction'))
}
## [1] "No further improvement after 9 variables."
finalDF = data.frame(finalDF)
colnames(finalDF) <- c('Number of covariates', 'Error', 'Type')
ggplot(finalDF, aes(x= `Number of covariates`, y=as.numeric(Error), col=Type)) +
geom_point() + scale_y_continuous(labels = scales::number_format(accuracy = 0.01))
For models with large number of covariates, training error is smaller but prediction error is large. Models with properly chosen less number of covariates have training error not largely different but much lesser prediction error.
First thing is to load required libraries or R packages for this exercise. Also set random seed for repeatability of the results.
# Required libraries
library(MASS)
library(corrplot)
library(tidyverse)
library(plotly)
set.seed(12345)
‘Boston’ data from ‘MASS’ package is read here. The data is about housing values and factor affecting housing values in Suburbs of Boston. The data has 506 rows and 14 columns. Each row has 14 aspect of the suburb line residential area, business area, pupil-teacher ratio, taxes etc. Let’s explore the data in more details.
# load the data
data("Boston")
# Structure of the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# Dimension of the dataset
dim(Boston)
## [1] 506 14
Summaries and distributions of the variables.
# Summary the dataset
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# distribution of variables and their relationships with each other
pairs(Boston)
# Better graphical presentation with ggpairs()
p <- ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
Distribution of all the variables can be seen from the above plot. ‘rm’ i.e Average number of rooms per dwelling has a nice bell shape curve suggesting normal distribution with mean 6.208. Variables ‘dis’, ‘Istat’ and ‘medv’ have Normal distribution curve with longer right tail. ‘indus’ and ‘tax’ seems bimodal.
To explore the relationships between the variables let’s use ‘corrplot’
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
‘medv’(median value of owner-occupied homes in $1000s.) shows correlation with all the variables. Except variable ‘chas’ (binary variable ‘tract bounds river’), all other variables have good correlation (positive or negative) with each other.
From the colorful corrplot, it can be seen that the highest positive correlation is between ‘tax’ (full-value property-tax rate per $10,000.) and ‘rad’ (index of accessibility to radial highways).
dis (weighted mean of distances to five Boston employment centres) has strong negative correlation with indus(proportion of non-retail business acres per town), age(proportion of owner-occupied units built prior to 1940) and nox(nitrogen oxides concentration (parts per 10 million)) indicating that near the employment centres there is higher proportion of non-retail business, higher proportion of owner-occupied units built prior to 1940 and higher nitrogen oxides concentration. As expected, Istat(lower status of the population (percent)) and medv(median value of owner-occupied homes in $1000s) are also highely negatively correlated.
‘chas’ (binary variable ‘tract bounds river’) has hardly any correlation with other variables. It is clear that although in the human history, human beings chose to stay near rivers or water sources in the ancient days, today it has less impact on choosing a housing.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
After scaling, all the variable ranges have reduced. Most of the 1st and 3rd quantiles are between (or near) -1 and 1. Next step is creating categorical variable of crime with each category indicating level of crime. Then the original variable ‘crim’ will be removed and this categorical variable will be added as a new column.
# create data frame object
boston_scaled <- as.data.frame(boston_scaled)
# create a categorical variable 'crime' with cut points as quantiles of the variable 'crim'
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Let’s create train data by randomly selecting 80% of the rows of the data. Remaining rows will be treated as test data.
# choose randomly 80% of the rows for training set
ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Linear discriminant analysis is performed on train data using crime as target variable and all other variables as covariates.After LDA fit is obtained the biplot is plotted.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2648515 0.2500000 0.2376238 0.2475248
##
## Group means:
## zn indus chas nox rm age
## low 1.0411093 -0.9021842 -0.1251477505 -0.8847044 0.40705116 -0.8865465
## med_low -0.0833036 -0.2819395 0.0005392655 -0.5730096 -0.10985904 -0.3410669
## med_high -0.3871935 0.2004107 0.1787970010 0.4091306 0.04385536 0.4041791
## high -0.4872402 1.0171519 -0.1148450583 1.0581412 -0.39225400 0.8234116
## dis rad tax ptratio black lstat
## low 0.9165942 -0.6974378 -0.7426399 -0.41762332 0.38119357 -0.7516720
## med_low 0.3466853 -0.5406779 -0.4979925 -0.02930992 0.32091020 -0.1609564
## med_high -0.3803789 -0.4925764 -0.3679189 -0.34561698 0.07515175 0.0204888
## high -0.8428566 1.6377820 1.5138081 0.78037363 -0.76621760 0.9513763
## medv
## low 0.4874449
## med_low 0.0085978
## med_high 0.1667764
## high -0.7788111
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09932684 0.660082422 -0.924446780
## indus 0.20684196 -0.173509150 0.355125653
## chas -0.03474956 -0.074291398 0.125240743
## nox 0.26876697 -0.748634714 -1.377750333
## rm 0.07723357 0.012536636 -0.123701216
## age 0.08442120 -0.325512288 -0.120973431
## dis -0.08550872 -0.218908959 0.157590088
## rad 4.49240099 0.992157796 0.006802416
## tax 0.15738914 -0.102939131 0.427837042
## ptratio 0.10718133 -0.003208959 -0.226272327
## black -0.02962752 0.097919438 0.165267471
## lstat 0.21600800 -0.222397929 0.317757617
## medv 0.03067033 -0.494810214 -0.193725250
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9702 0.0230 0.0068
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
‘rad’ has a longest arrow so it is the most discriminant among all the predictors. ‘rad’ can differentiate between crime categories well.
We would like to predict ‘crime’ categories for the test data so keep a copy of true ‘crime’ categories of the test data in ‘correct_classes’ and then delete ‘crime’ column from test data.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 7 0 0
## med_low 5 16 4 0
## med_high 0 9 16 5
## high 0 0 0 27
The cross tabulation show that category ‘high’ in the test data is very well predicted. Category ‘med_high’ shows worst categorization prediction. Out of 30 ‘med_high category in the test data, only 16 are correctly predicted. ’low’ and ‘med_low’ categories are predicted 2/3 times correctly.
Let’s consider the Boston data again. Compute various distances like euclidean distance, manhattan distance on the scaled Boston data.
data("Boston")
# center and standardize variables
boston_scaled <- scale(Boston)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of euclidean distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method = 'manhattan')
# look at the summary of manhattan distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Even on the scaled data, two different distances have very different range. Let’s perform k-means clustering with 3 clusters.
# k-means clustering
km3 <-kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km3$cluster)
Overall, the graphs show groups of 3 colors but all 3 colors are not seen in all the graphs. Also, some times they do not properly form clusters.
To find appropriate number of clusters, consider k= 1 to 10 all 10 values one by one.
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The plot shows that from cluster 1 to 2 there is large improvement. There is not much further improvement. If we are interested in very small SS (sum of squares) then one can consider more clusters. However, having too many clusters is not always useful if they do not actually differ in all the variables. Let’s go ahead with k =2 which has good improvement over k=1.
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
Almost all the smaller graphs show good separation of 2 groups with 2 colors. Hence, k=2 was a good choice.
Let’s use k=4 to generate clusters. Then perform LDA to check how these 4 clusters are analyzed using linear discrimination analysis.
# k-means clustering
km4 <-kmeans(boston_scaled, centers = 4)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km4$cluster)
boston_scaled1 = data.frame(boston_scaled)
boston_scaled1= boston_scaled1 %>%
mutate(cluster =km4$cluster )
# linear discriminant analysis
lda.fit4 <- lda(cluster ~ ., data = boston_scaled1)
# print the lda.fit object
lda.fit4
## Call:
## lda(cluster ~ ., data = boston_scaled1)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.2727273 0.4051383 0.1185771 0.2035573
##
## Group means:
## crim zn indus chas nox rm age
## 1 0.9756821 -0.4872402 1.0939296 -0.21526964 0.9652909 -0.4642740 0.7722565
## 2 -0.3888773 -0.3531707 -0.3412963 -0.27232907 -0.4124824 -0.1646322 -0.1241405
## 3 -0.2131078 -0.3300240 0.4860617 1.49936604 0.9890166 0.1347329 0.7475175
## 4 -0.4091050 1.5479668 -1.0695170 -0.04298342 -1.0484685 0.8712179 -1.2230451
## dis rad tax ptratio black lstat
## 1 -0.8278800 1.4598695 1.4701648 0.8275348 -0.71273723 0.93140384
## 2 0.1406541 -0.5964345 -0.6080622 0.1676730 0.31637467 -0.15551283
## 3 -0.7281887 -0.2889628 -0.1777283 -1.2881927 -0.04951573 0.01817238
## 4 1.2534434 -0.6005355 -0.6559834 -0.6920507 0.35409586 -0.94897033
## medv
## 1 -0.78074928
## 2 -0.08063264
## 3 0.47647538
## 4 0.92897641
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim 0.012656810 0.02428702 -0.14961045
## zn 0.095335547 0.40053401 -1.13758796
## indus -0.504615517 -0.19850861 -0.24468003
## chas 0.157622073 -0.86698142 -0.33492529
## nox 0.225995414 -0.89706014 -0.52034056
## rm 0.021371579 0.30917185 -0.33017316
## age -0.167677382 -0.43903833 0.37200499
## dis 0.362302731 -0.13797266 -0.43019018
## rad -1.477515958 0.41995946 -0.16247733
## tax -0.832279137 0.18854689 -0.61346017
## ptratio 0.007751927 0.89661450 0.30890506
## black 0.005638473 0.08388227 0.07581497
## lstat -0.199096514 0.28003158 -0.34738897
## medv 0.217842616 -0.11520682 -0.55811203
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.6775 0.1791 0.1435
# target classes as numeric
classes <- as.numeric(boston_scaled1$cluster)
# plot the lda results
plot(lda.fit4, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit4, myscale = 2)
Variables with longer arrows in the plot are the best linear discriminants for these 4 groups.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)